Overview of Data

The purpose of this dataset was to help find predictors of diabetes. The goal is to help medical professionals identify patients with potential risk factors of diabetes.The dataset contains information on patient demographics such as age and gender, as well as medical information including blood glucose levels and hypertension. The observations in this dataset were obtained from research studies and healthcare institutions. The dataset was obtained via Kaggle.

Mustafa, T. Z. (2021). Diabetes Prediction Dataset. Kaggle. https://www.kaggle.com/datasets/iammustafatz/diabetes-prediction-dataset

The dataset consists of 8 predictor variables for diabetes: ‘gender’ the patient’s gender as male or female. ‘age’ the age of the patient in years. ‘hypertension’ yes or no, on whether the patient has hypertension. ‘heart disease’ yes or no, on whether the patient has heart disease. ‘smoking history’ the patient’s smoking history indicated as never, no info, current, former or not current. ‘bmi’ the patient’s body mass index (kg/m**2). ‘HbA1c level’ the patient’s average blood sugar level over the past 2-3 months (%). ‘blood glucose level’ amount of blood sugar in the patient’s blood (mg/dL).

#load in dataset
data <-read.csv('https://raw.githubusercontent.com/GabbyK8/Datasets/refs/heads/main/diabetes_prediction_dataset.csv')
diabetesd<-data

#change heart disease, diabetes and hypertension to 1 and 0 for easier use in MICE.
data$diabetes <- recode(data$diabetes, "1"="Yes", "0"="No")
data$hypertension <- recode(data$hypertension, "1"="Yes", "0"="No")
data$heart_disease <- recode(data$heart_disease, "1"="Yes", "0"="No")

#plot interaction between diabetes and hypertension.
ggplot(data, aes(x = hypertension, fill = diabetes)) +
  geom_bar(position = "dodge") + # Use "stack" for a stacked bar chart
  labs (title="Risk of Diabetes by Hypertension", x="Hypertension", y= "Frequency",
        fill= "Diabetes")
The plot of shows the frequency of patients with diabetes and hypertension. The plot indiactes a higher frequency of patients not having diabetes and not having hypertension. There does not seem to be an interaction between having diabetes and having hypertension based on the low frequency of patients having both.

The plot of shows the frequency of patients with diabetes and hypertension. The plot indiactes a higher frequency of patients not having diabetes and not having hypertension. There does not seem to be an interaction between having diabetes and having hypertension based on the low frequency of patients having both.

Comparison of Two Categorical Variables

The purpose of this graph is to show the relationship between having hypertension and diabetes. For this graph, I changed the values of 1 in both the diabetes and hypertension variable to “Yes” and the value 0 to “No.” These integers were meant to represent categorical variables with 1 correlating to being positive for hypertension or diabetes and 0 correlating to being negative to hypertension or diabetes. Based on our graph, we can see that there is no correlation between having diabetes and having hypertension, with most patients within our sample having neither diabetes or hypertension. Thus, hypertension would not be a good predictor of diabetes within patients.

interact = ggplot(data, aes(x =blood_glucose_level , y = HbA1c_level, group = diabetes, color = diabetes)) +
  geom_line(size = 1) +  # Lines representing interaction
  geom_point(size = 2) + # Points for data
  labs(
    title = "Interaction of Blood Glucose and HbA1c Levels",
    x = "Blood Glucose Level",
    y = "HbA1c Level",
    color = "Presence of Diabetes"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 14, face = "bold", hjust = 0.5),
    axis.text = element_text(size = 12),
    legend.position = "top"
  )
interact
The plot shows the interaction between blood glucose levels and HbA1c levels on diabetes. Based on the plot, higher blood glucose levels and HbA1c levels are positively correlated with the presence of diabetes. In comparison, lower levels, indicate that a patient will more commonly not have diabetes. The height of the points shown for diabetes represent the levels of HbA1c, based on the plot, HbA1c levels predict the presence of diabetes more often than blood glucose levels.

The plot shows the interaction between blood glucose levels and HbA1c levels on diabetes. Based on the plot, higher blood glucose levels and HbA1c levels are positively correlated with the presence of diabetes. In comparison, lower levels, indicate that a patient will more commonly not have diabetes. The height of the points shown for diabetes represent the levels of HbA1c, based on the plot, HbA1c levels predict the presence of diabetes more often than blood glucose levels.

Comparison of Two Numerical Variables

The two variables being compared here are Blood Glucose Level and HbA1c Level. HbA1cc is a shortened term for hemoglobin A1c. Hb1Ac level is a measure of average blood sugar levels over 2-3 months and is measured in %. Blood glucose level is the amount of blood sugar in a patients blood at a given time. Our plot shows a correlation between high levels in HbA1c and blood glucose levels and it’s relationship with diabetes. This plot suggests patients with high levels of HbA1c and blood glucose levels are at high risk for diabetes.

#plot of a numerical and categorical variable
 boxplot(data$age ~ data$diabetes,
         xlab="Patient Age",
         ylab="Presence of Diabetes",
         col = c("skyblue", "purple"),
         main="Visualization of Diabetes by Age",
         cex.main = 1.1,
         col.main = "navy")
The boxplot shows the relationship between age and presence of diabetes. The boxplot shows that older patients on average are more often have diabetes. There are a few outliers in the diabetes population that appear to be under the age of 20 years. This could be representative of patients with Type I diabetes, since the data does not distinguish between Type I and Type II diabetes.

The boxplot shows the relationship between age and presence of diabetes. The boxplot shows that older patients on average are more often have diabetes. There are a few outliers in the diabetes population that appear to be under the age of 20 years. This could be representative of patients with Type I diabetes, since the data does not distinguish between Type I and Type II diabetes.

Comparison of a Numerical and Categorical Variables

This chart shows the relationship between age and diabetes diagnosis. This chart shows that as patients grow older they are at higher risk for diabetes. This would make age a predictor for diabetes, with the average age for those without diabetes being 40 years and 60 years for those with diabetes. However, we can see that we do have a few outliers for those with diabetes with some patients being diagnosed at 20 or younger. This could be from the population of people with Type I diabetes which is usually diagnosed in patients under 20. Our dataset does not distinguish between type I and type II diabetes.

Conclusion for Comparison of Variables

Diabetes can be predicted using age, blood glucose levels, and HbA1c levels in patients. However, our results showed hypertension is not a good predictor of diabetes. Based on our findings, it would be interesting to look at factors that occur once a patient becomes older that could lead to diabetes at an older age. As well, diet could be another factor to look into, such as grams of sugar consumed on a daily or weekly basis, to see if sugar consumption can predict diabetes. Another important thing to analyze in future studies would be Type I versus Type II diabetes, as it could be the reason for some of the outliers in the boxplot.

Overview of Missing Variables

An important step in data analysis for machine learning is handling missing values. Missing values in a dataset can occur for many reasons, but not limited to non-response in surveys, participants leaving the study, data entry errors and system limitations. If missing values are not addressed, models can become biased and accuracy of the anaylysis can decrease. This section examines different imputation strategies to effectively handle missing values.

The imputation methods we will examine include:

Replacement Imputation for Categorical Features: Mode imputation and KNN-based imputation.

Regression-based Imputation for Numerical Features: Predictive modeling to estimate missing values.

Multiple Imputation: Advanced methods such as MICE to improve robustness.

# create random observation ID and replace the corresponding obs with missing
ltr.missing.id <- sample(1:1000, 50, replace = FALSE)
gpa.missing.id <- sample(1:1000, 15, replace = FALSE) 
diabetesd$bmi[ltr.missing.id] <- NA
diabetesd$heart_disease[gpa.missing.id] <- NA
diabetesd$smoking_history[diabetesd$smoking_history == "No Info"] <- NA

Mode Imputation for Categorical Variables

In this method, missing variables are replaced with the most frequent category of the corresponding variable. Mode imputation ensures consistency and works well when missing values are randomly distributed. Below, mode imputation is used on the variable, heart disease, which is 1 if the patient has heart disease and 0 if the patient does not have heart disease.The variable smoking history had current, former, no info, never, and ever as possible values. No Info indicates that we do not have information on the participant’s smoking history and will be treated as a missing values.

diabetesd <-diabetesd
# Function to impute mode

Mode <- function(x) {
  ux <- unique(na.omit(x))  # Remove NAs before computing mode
  tab <- tabulate(match(x, ux))
  mode_value <- ux[which.max(tab)]
  
  # Ensure mode_value is of the same type as x
  if (is.factor(x)) {
    return(factor(mode_value, levels = levels(x)))
  } else if (is.character(x)) {
    return(as.character(mode_value))
  } else {
    return(as.numeric(mode_value))
  }
}
#apply mode imputation
value_imputed <- data.frame(
  original = diabetesd$heart_disease,
  original2=diabetesd$smoking_history,
  imputed_modehd = replace(diabetesd$heart_disease, is.na(diabetesd$heart_disease), Mode((diabetesd$heart_disease))),
  imputed_modesh = replace(diabetesd$smoking_history, is.na(diabetesd$smoking_history), Mode(diabetesd$smoking_history)))
summary(value_imputed)
    original        original2         imputed_modehd    imputed_modesh    
 Min.   :0.00000   Length:100000      Min.   :0.00000   Length:100000     
 1st Qu.:0.00000   Class :character   1st Qu.:0.00000   Class :character  
 Median :0.00000   Mode  :character   Median :0.00000   Mode  :character  
 Mean   :0.03943                      Mean   :0.03942                     
 3rd Qu.:0.00000                      3rd Qu.:0.00000                     
 Max.   :1.00000                      Max.   :1.00000                     
 NA's   :15                                                               
#plot the comparison of original data and mode imputed data
h1 <- ggplot(value_imputed, aes(x = original)) +
  geom_histogram(bins=10, fill = "#ad1538", color = "#000000", position = "identity") +
  ggtitle("Original for Heart Disease") +
  theme_classic()
h2 <- ggplot(value_imputed, aes(x = original2)) +
  geom_bar( fill = "#15ad4f", color = "#000000", position = "identity") +
  ggtitle("Original for Smoking History") +
  theme_classic()
h3 <- ggplot(value_imputed, aes(x = imputed_modehd)) +
  geom_histogram(bins=10, fill = "#1543ad", color = "#000000", position = "identity") +
  ggtitle("Mode-imputed for Heart Disease") +
  theme_classic()
h4 <- ggplot(value_imputed, aes(x = imputed_modesh)) +
  geom_bar( fill = "#ad8415", color = "#000000", position = "identity") +
  ggtitle("Mode-imputed for Smoking History") +
  theme_classic()


plot_grid(h1, h2, h3, h4, nrow = 2, ncol = 2)
<img src=“First-Project_files/figure-html/unnamed-chunk-5-1.png” alt=“The plot shows the comparison between the original data containing missing values for the variables heart disease and smoking history, and the data after using mode impute. The histograms show that mode impute used what was most common to replace the missing values. This is seen clearly in the variable smoking history, where”never” went from originally a count of about 40,000 to about 70,000. Heart disease had a small amount of missing data (only 15), however the 15 missing were changed to 0, indicating the patient would not have heart disease.” width=“672” />

The plot shows the comparison between the original data containing missing values for the variables heart disease and smoking history, and the data after using mode impute. The histograms show that mode impute used what was most common to replace the missing values. This is seen clearly in the variable smoking history, where “never” went from originally a count of about 40,000 to about 70,000. Heart disease had a small amount of missing data (only 15), however the 15 missing were changed to 0, indicating the patient would not have heart disease.

Regression-Based Imputation for Numerical Variables

Using regression models, we can estimate missing values. For the variable, bmi, we can predict the missing values using estimates from our model. Bmi is a continuous variable, unlike the previous example where our values could only be 0 or 1. In this method, a linear model was created to help create estimates of our values.

#create a dataset for regression impute
regimpute<-diabetesd

# Identify rows where BMI is missing
missing_rows <- which(is.na(diabetesd$bmi))  # Get row indices


# Train a linear model using complete cases
lm_model <- lm(bmi ~ age + blood_glucose_level + HbA1c_level + heart_disease + hypertension + diabetes, 
               data = regimpute, na.action = na.exclude)

# Impute missing BMI values using the model
diabetesd$bmi[missing_rows] <- predict(lm_model, newdata = regimpute[missing_rows, ])


dep.mi <- bind_rows(
  data.frame(value = diabetesd$bmi, imputed = regimpute$bmi))
             #imp = "dep.imp1")


i7<-boxplot(diabetesd$bmi,
main = "Mean bmi in Original Data",
xlab = "BMI",
col = "blue",
border = "black",
horizontal = TRUE,
notch = TRUE
)
The boxplots show the comparison of the variable bmi before and after regression imputation.

The boxplots show the comparison of the variable bmi before and after regression imputation.

i8 <-boxplot(regimpute$bmi,
main = "Regression Imputation for BMI",
xlab = "BMI",
col = "red",
border = "black",
horizontal = TRUE,
notch = TRUE
)
The boxplots show the comparison of the variable bmi before and after regression imputation.

The boxplots show the comparison of the variable bmi before and after regression imputation.

MICE Imputation

MICE is multiple imputation techniques used to impute missing values. This method also uses regression models and accounts for uncertainty by generating multiple possible values. Using MICE, the variables smoking history, heart disease, and bmi, can be imputated within one function. MICE can handle different types of variables, imputates based on the relationship between variables and it accounts for variability in missing data.

#reload dataset so that it is free of mode and regression imputation done previously
data2 <- read.csv('https://raw.githubusercontent.com/GabbyK8/Datasets/refs/heads/main/diabetes_prediction_dataset.csv')

#create missing values and make smoking history a binary variable
ltr.missing.id <- sample(1:1000, 50, replace = FALSE)
gpa.missing.id <- sample(1:1000, 15, replace = FALSE) 
data2$smoking_history[data2$smoking_history== "No Info"] <-NA
data2$bmi[ltr.missing.id] <- NA
data2$heart_disease[gpa.missing.id] <- NA

#make variables numeric
data2$gender <- as.numeric(as.factor(data2$gender))
data2$heart_disease<- as.factor(as.numeric(data2$heart_disease))
data2$smoking_history<-as.factor(data2$smoking_history)
#apply MICE imputation
df_mice <- mice(data2, method = c("","","","pmm","polyreg","pmm","","",""),  m = 20, maxit=20,seed = 123, print=F)

# Complete dataset with imputed values
df_imputed <- complete(df_mice)

#plot comparisons of before and after MICE imputation for heart disease
g1 <- ggplot(value_imputed, aes(x = imputed_modehd)) +
  geom_bar(fill = "#1543ad", color = "#000000", position = "identity") +
  ggtitle("Mode-imputed") +
  labs(x="Heart Disease")+
  theme_classic()
g2 <- ggplot(value_imputed, aes(x = original)) +
  geom_bar(fill = "#ad1538", color = "#000000", position = "identity") +
  ggtitle("Original") +
  labs(x="Heart Disease")+
  theme_classic()
g3 <- ggplot(df_imputed, aes(x = heart_disease)) +
  geom_bar(fill = "purple", color = "#000000", position = "identity") +
  ggtitle("MICE") +
  labs(x="Heart Disease")+
  theme_classic()
plot_grid(g1, g2, g3, nrow=1, ncol =3)
Based on the three graphs, MICE seems to have produced values that appear almost identical to the original dataset. Mode impute seems to be slightly different since it relies on filling missing values with the most common value. In this case, mode impute filled all missing values as 0 and under-represented 1.

Based on the three graphs, MICE seems to have produced values that appear almost identical to the original dataset. Mode impute seems to be slightly different since it relies on filling missing values with the most common value. In this case, mode impute filled all missing values as 0 and under-represented 1.

#plot comparisons for smoking history
g4 <- ggplot(df_imputed, aes(x = smoking_history)) +
  geom_bar( fill = "purple", color = "#000000", position = "identity") +
  ggtitle("MICE") +
  labs(x="Smoking History")+
  theme_classic()
g5 <- ggplot(value_imputed, aes(x = original2)) +
  geom_bar( fill = "red", color = "#000000", position = "identity") +
  ggtitle("Original Smoking History") +
  labs(x="Smoking History")+
  theme_classic()
g6 <- ggplot(value_imputed, aes(x = imputed_modesh)) +
  geom_bar( fill = "blue", color = "#000000", position = "identity") +
  ggtitle("Mode-imputed") +
  labs(x="Smoking History")+
  theme_classic()
plot_grid(g4,g5,g6, nrow=3, ncol=1)
<img src=“First-Project_files/figure-html/unnamed-chunk-8-1.png” alt=“The bar graphs show the comparison of the original data to mode imputation and MICE. MICE appears to have dispersed the data more evenly across the different categories, while mode imputation has only added values to”never.” width=“672” />

The bar graphs show the comparison of the original data to mode imputation and MICE. MICE appears to have dispersed the data more evenly across the different categories, while mode imputation has only added values to “never.

#plot comparisons for bmi
g7<-boxplot(diabetesd$bmi,
main = "Mean bmi in Original Data",
xlab = "BMI",
col = "blue",
border = "black",
horizontal = TRUE,
notch = TRUE
)
The boxplots show the comparison of the original data to both MICE and regression imputation.

The boxplots show the comparison of the original data to both MICE and regression imputation.

g8 <-boxplot(regimpute$bmi,
main = "Regression Imputation for BMI",
xlab = "BMI",
col = "red",
border = "black",
horizontal = TRUE,
notch = TRUE
)
The boxplots show the comparison of the original data to both MICE and regression imputation.

The boxplots show the comparison of the original data to both MICE and regression imputation.

g9 <-boxplot(df_imputed$bmi,
main = "MICE for BMI",
xlab = "BMI",
col = "purple",
border = "black",
horizontal = TRUE,
notch = TRUE
)
The boxplots show the comparison of the original data to both MICE and regression imputation.

The boxplots show the comparison of the original data to both MICE and regression imputation.

model5 <- with(df_mice, lm(diabetes ~ bmi+ heart_disease+ smoking_history ))
summary(pool(model5))
                        term      estimate    std.error    statistic         df
1                (Intercept) -1.588938e-01 0.0042264348 -37.59524385  8765.6364
2                        bmi  8.387455e-03 0.0001291007  64.96833640 95903.5486
3             heart_disease1  2.213973e-01 0.0044025602  50.28829659 97099.1099
4        smoking_historyever  4.374612e-03 0.0044673218   0.97924717   768.3395
5      smoking_historyformer  3.923135e-02 0.0034242201  11.45701814  1528.3749
6       smoking_historynever -3.066871e-05 0.0026838513  -0.01142713   847.0962
7 smoking_historynot current  6.032198e-03 0.0038465169   1.56822334   577.2461
        p.value
1 6.484235e-287
2  0.000000e+00
3  0.000000e+00
4  3.277661e-01
5  3.251856e-29
6  9.908854e-01
7  1.173770e-01

Skewness

Skewness is a measure of symmetry within a distribution. The skewness measurements below are of the variables age, bmi, HbA1c level, and blood glucose level, respectively. When skewness is equal to 0, we can assume the data is normally distributed. When skewness is greater than 0, our data is positively skewed, while a value less than zero indicates negatively skewed data. For the diabetes dataset, the variables bmi and blood glucose level are 1 or about 1, which indicate a significant right skew in the distribution.

age1<- skewness(df_imputed$age) 
bmi1<-skewness(df_imputed$bmi) 
HbA1c1<-skewness( df_imputed$HbA1c_level) 
glucose1<-skewness(df_imputed$blood_glucose_level)

mytable<-data.frame(age1,bmi1,HbA1c1, glucose1)
kable(mytable, caption= "Skewness of Numeric Variables")
Skewness of Numeric Variables
age1 bmi1 HbA1c1 glucose1
-0.0519782 1.041976 -0.0668528 0.8216426

Transforming Data

To handle our skewed variables, we can normalize the variables bmi and blood glucose level. Normalization rescales data into a fixed range of [0,1]. By normalizing our data, we can fix skewness and create normally distributed data. The variables, age and HbA1c level have been standardized to keep the distribution centered. Standardization is used to create a mean of 0 and a standard deviation of 1, to follow a normal distribution. Based on the results below, we can see that the transformation has corrected the right skew in the data from before.

df_trans <- df_imputed  # Copy original dataset after MICE

#ensure diabetes is numeric and a factor
df_trans$diabetes<- as.numeric(as.factor(df_trans$diabetes))
# Min-Max Normalization
normalize <- function(x) {
  if (max(x) - min(x) == 0) return(rep(0, length(x)))  # Prevent division by zero
  return((x - min(x)) / (max(x) - min(x)))
}

#apply normalization
df_trans$bmi <- normalize(df_imputed$bmi)
df_trans$blood_glucose_level <- normalize(df_imputed$blood_glucose_level)

# Z-Score Standardization
standardize <- function(x) {
  return((x - mean(x, na.rm = TRUE)) / sd(x, na.rm = TRUE))
}

#apply standardization
df_trans$age <- standardize(df_trans$age)
df_trans$HbA1c_level <- standardize(df_trans$HbA1c_level)

# Apply Log Transformation to Fix Skewness (if necessary)
df_trans$bmi <- log1p(df_trans$bmi)  # log(1 + x) avoids log(0) issues
df_trans$blood_glucose_level <- log1p(df_trans$blood_glucose_level)

bmi2 <- skewness(df_trans$bmi)
age2 <- skewness(df_trans$age)
HbA1c2<- skewness(df_trans$HbA1c_level)
glucose2<- skewness(df_trans$blood_glucose_level)

newtable<-data.frame(bmi2,age2,HbA1c2,glucose2)
kable(newtable, caption="Newly Transformed Data")
Newly Transformed Data
bmi2 age2 HbA1c2 glucose2
0.673108 -0.0519782 -0.0668528 0.2867457

Transforming our data has successfully decreased skewness in our variables. We can see bmi and blood glucose level are no longer about 1.

# Combine old and new data for comparison
df_compare <- data.frame(
  bmi_original = df_imputed$bmi,
  bmi_transformed = df_trans$bmi,
  blood_glucose_original = df_imputed$blood_glucose_level,
  blood_glucose_transformed = df_trans$blood_glucose_level
)

# Reshape data for ggplot
df_long <- melt(df_compare)

# Plot histograms
ggplot(df_long, aes(x = value, fill = variable)) +
  geom_histogram(alpha = 0.6, bins = 30, position = "identity") +
  facet_wrap(~ variable, scales = "free") +
  theme_minimal() +
  labs(title = "Comparison of Original vs. Transformed Data", x = "Value", y = "Count")
We can see in the graphs above that after transforming the dataset, the data seems to follow a normal distribution more closely.This is apparent in the histograms appearing more in a bell shape curve.

We can see in the graphs above that after transforming the dataset, the data seems to follow a normal distribution more closely.This is apparent in the histograms appearing more in a bell shape curve.

Analyzing Data

The next step is to find which variables are significantly important when trying to predict diabetes. By using feature selection, we are able to predict possible best fit models for our variable, diabetes. For this data, we will focus on logistic regression. The variables of interest, diabetes, is a binary variable indicating linear regression would not make an ideal model.

Random Forest First, we will look at the best model selected by feature selection through random forest:

# Prepare the data
df_trans$diabetes <- as.factor(df_trans$diabetes)


# Set up RFE control with random forest
control <- rfeControl(functions = rfFuncs, method = "cv", number = 10)

# Apply RFE to identify top features
results <- rfe(
  x = df_trans[, !colnames(df_trans) %in% "diabetes"],
  y = df_trans$diabetes,
  sizes = c(1:9),
  rfeControl = control
)

# Print the selected features
models<-data.frame(predictors(results))
kable(models, caption="Random Forest Selected Model")
Random Forest Selected Model
predictors.results.
HbA1c_level
blood_glucose_level
bmi
heart_disease
age
hypertension
gender
smoking_history

Stepwise Logistic Regression Next, let’s look and stepwise logistic regression

#Stepwise Logistic Regression
# Fit a full logistic regression model with all predictors
full_model <- glm(diabetes ~ ., data = df_trans, family = binomial)

# Perform stepwise selection (default is backward elimination)
step_model1 <- step(full_model, direction = "both")
Start:  AIC=23047.57
diabetes ~ gender + age + hypertension + heart_disease + smoking_history + 
    bmi + HbA1c_level + blood_glucose_level

                      Df Deviance   AIC
- smoking_history      4    23029 23045
<none>                      23024 23048
- gender               1    23078 23100
- heart_disease        1    23176 23198
- hypertension         1    23308 23330
- bmi                  1    24361 24383
- age                  1    25096 25118
- blood_glucose_level  1    30572 30594
- HbA1c_level          1    34856 34878

Step:  AIC=23044.52
diabetes ~ gender + age + hypertension + heart_disease + bmi + 
    HbA1c_level + blood_glucose_level

                      Df Deviance   AIC
<none>                      23029 23045
+ smoking_history      4    23024 23048
- gender               1    23087 23101
- heart_disease        1    23184 23198
- hypertension         1    23314 23328
- bmi                  1    24369 24383
- age                  1    25193 25207
- blood_glucose_level  1    30582 30596
- HbA1c_level          1    34865 34879
summary(step_model1)

Call:
glm(formula = diabetes ~ gender + age + hypertension + heart_disease + 
    bmi + HbA1c_level + blood_glucose_level, family = binomial, 
    data = df_trans)

Coefficients:
                    Estimate Std. Error z value Pr(>|z|)    
(Intercept)         -9.99142    0.11057 -90.360  < 2e-16 ***
gender               0.27068    0.03536   7.654 1.94e-14 ***
age                  1.03388    0.02428  42.583  < 2e-16 ***
hypertension         0.80228    0.04665  17.198  < 2e-16 ***
heart_disease1       0.76240    0.06014  12.677  < 2e-16 ***
bmi                 10.23780    0.28095  36.439  < 2e-16 ***
HbA1c_level          2.49236    0.03769  66.122  < 2e-16 ***
blood_glucose_level 10.29954    0.14902  69.113  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 58163  on 99999  degrees of freedom
Residual deviance: 23029  on 99992  degrees of freedom
AIC: 23045

Number of Fisher Scoring iterations: 8

Both selection types selected the same model! However I will still go through the steps of cross-validation as though I was comparing the two models.

Cross-Validation

Let’s explore cross-validation to see which model is best fit. For logistic regression, it is meaningful to look at accuracy and kappa.

#add classification
df_trans$diabetes<- factor(ifelse(df_trans$diabetes==1, "Yes","No"))

train_control <- trainControl(
  method = "cv",         # cross-validation
  number = 10,           # 10-fold
  classProbs = TRUE # use if you're tracking AUC, Sensitivity, etc.
)

#run CV model
cv_model <- train(
  diabetes ~ hypertension + gender +age + HbA1c_level + blood_glucose_level + heart_disease + bmi, 
  data=df_trans, 
  method = "glm", 
  family = "binomial", 
  trControl = train_control
  )

print(cv_model)
Generalized Linear Model 

1e+05 samples
7e+00 predictors
2e+00 classes: 'No', 'Yes' 

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 90000, 90000, 90000, 90000, 90000, 90000, ... 
Resampling results:

  Accuracy  Kappa    
  0.95935   0.7015624

Accuracy shows the overall correctness of the model predictions and kappa measures the levels of agreement between variables. Values closer to 1 suggest higher accuracy and better agreement. This model suggests high accuracy and high agreement based on the values.

ROC Curve and AOC Curve

Confusion Matrices ROC is also known as Receiver Operating Characteristic Analysis. It is a technique using graphs that evaluates the performance of a binary model. Before plotting the ROC curve, it is necessary to calculate the true positive rate (TPR) and false positive rate (FPR). Below, are confusion matrices that were used to calculate those values.

#ROC and AUC
df_trans$diabetes <- as.factor(df_trans$diabetes)

# fit a logistic
model.logit <- glm(diabetes ~ age + gender+ bmi+hypertension + heart_disease + blood_glucose_level+ HbA1c_level , family = binomial, data = df_trans)
# predict probability of P(Y = "Yes")
probabilities <- round(as.vector(predict(model.logit, type = "response")),3)
#
thresholds <- c(0.0, 0.25, 0.5, 0.75, 1.0)

# Loop through thresholds and create confusion matrices
for (threshold in thresholds) {
  cat("\nConfusion Matrix for Threshold =", threshold, "\n")
  
  # Convert probabilities to predictions
  # am: 1 = manual transmission, 0 = automatic transmission
  predictions <- ifelse(probabilities > threshold, "Yes", "No")
    all_levels <- union(levels(factor(df_trans$diabetes)), levels(factor(predictions)))
    # Generate confusion matrix
  cm <- confusionMatrix(factor(predictions, levels=all_levels), factor(df_trans$diabetes,levels=all_levels), positive = "Yes")
  print(cm$table)
}

Confusion Matrix for Threshold = 0 
          Reference
Prediction    No   Yes
       No    354     0
       Yes  8146 91500

Confusion Matrix for Threshold = 0.25 
          Reference
Prediction    No   Yes
       No   4216   146
       Yes  4284 91354

Confusion Matrix for Threshold = 0.5 
          Reference
Prediction    No   Yes
       No   5307   867
       Yes  3193 90633

Confusion Matrix for Threshold = 0.75 
          Reference
Prediction    No   Yes
       No   6322  3340
       Yes  2178 88160

Confusion Matrix for Threshold = 1 
          Reference
Prediction    No   Yes
       No   8500 91500
       Yes     0     0

ROC Curve

#create ROC curve

TPR = c(1,91500/(91500+0), 91356/(91356+144), 90632/(90632+868), 88160/(88160+3340), 0/(91500+0))
FPR = c(1,352/(352+8148), 4282/(4282+4218), 3193/(5307+3193), 2178/(2178+6322), 0/(8500+0))
plot(FPR, TPR, type = "b", main = "An Illustrative ROC Curve", col ="blue",
     xlab="1 - Specifity (FPR)", ylab = "Sensitivity (TPR)")
# add a off-diagonal representing random guess algorithm in binary prediction
abline(0,1, lty = 2, col = "red")
# legend
legend("bottomright", c("Logistic Model", "Random Guess"),
       col=c("blue", "red"), lty = 1:2, bty="n", cex = 0.9)

AUC (Area under the Curve) AUC quantifies the performance of the ROC curve into a single value that ranges from 0 to 1. The plot below uses the Riemann Sum approximation to estimate the area under the curve.

#AUC
TPR = round(c(1,91500/(91500+0), 91356/(91356+144), 90632/(90632+868), 88160/(88160+3340), 0/(91500+0)),3)
FPR = round(c(1,352/(352+8148), 4282/(4282+4218), 3193/(5307+3193), 2178/(2178+6322), 0/(8500+0)),3)
TPR0 = TPR[7:1]
FPR0 = FPR[7:1]
#plot(FPR0, TPR0, type = "b")
datSenSpe = data.frame(TPR0, FPR0)
ggROC = ggplot(data = datSenSpe, aes(x = FPR0, y=TPR0)) +
        geom_line(col = "steelblue") +
        geom_point(col = "red") +
        geom_segment(x = FPR0, y = 0, xend = FPR0, yend = TPR0, color = 4) +
        geom_segment(x = 0, y = 0, xend = FPR0[7], yend = 0, color = 6) +
        ggtitle("Approximating the AUC of Logistic Model") +
        xlab("1-specificity (FPR)") + 
        ylab("Sensitivity (TPR)") +
        annotate("text", x = 0.025, y = 0.125, label= "A") + 
        annotate("text", x = 0.105, y = 0.5, label = "B") +
        annotate("text", x = 0.605, y = 0.5, label = "C") +
        theme(plot.title = element_text(hjust = 0.5),
              legend.position = c(0.8, 0.2),
              plot.margin = unit(c(0.15, 0.15, 0.75, 0.15), "inches"),
              axis.line = element_line(size = 2, colour = "navy", linetype=1))
# partition the region under the ROC into trapezoids
ggplotly(ggROC)

AUC Value

A<-(0.041*1)/2
B<- ((0.504-0.041)*1)
C<- ((1-0.504)*1)

AUC<- A+B+C
kable(AUC, caption= "Area under the Curve")
Area under the Curve
x
0.9795

Conclusion for Model Fit

Our best fit model is : \[ \text{ Diabetes} = 9.99 - 0.27\times \text{gender} - 1.03\times \text{age} -0.8\times \text{hypertension} - 0.76\times \text{heart disease} -10.24\times \text{bmi} - 2.49\times \text{HbA1c level} - -10.3\times \text{blood glucose level} \]

The best fit model was selected based on a variety of different features. Both stepwise selection and random forest selected the model as best fit. After running performance analyses on the model, the model was shown to be of high performance. This means that the variables chosen for the final model are significant for predicting diabetes. The model chose not to select smoking history as significant predictor for the response variable. A possible suggestion for this would be to reduce the number of response values to a binary response. Such as, “Have you ever smoked?” and have the response be “Yes” or “No”. This could help in predicting the relationship between smoking and diabetes.

---
title: "Applied Machine Learning for Data Analysis"
author: "Gabriella Khalil"
date: "2025-02-18"
output: 
  html_document: 
    toc: yes
    toc_depth: 4
    toc_float: yes
    number_sections: no
    toc_collapsed: yes
    code_folding: hide
    code_download: yes
    smooth_scroll: yes
    theme: lumen
  word_document: 
    toc: yes
    toc_depth: 4
    fig_caption: yes
    keep_md: yes
  pdf_document: 
    toc: yes
    toc_depth: 4
    fig_caption: yes
    number_sections: no
    fig_width: 3
    fig_height: 3
editor_options: 
  chunk_output_type: inline
---


```{=html}

<style type="text/css">

/* Cascading Style Sheets (CSS) is a stylesheet language used to describe the presentation of a document written in HTML or XML. it is a simple mechanism for adding style (e.g., fonts, colors, spacing) to Web documents. */

h1.title {  /* Title - font specifications of the report title */
  font-size: 22px;
  font-weight: bold;
  color: DarkRed;
  text-align: center;
  font-family: "Gill Sans", sans-serif;
}
h4.author { /* Header 4 - font specifications for authors  */
  font-size: 18px;
  font-weight: bold;
  font-family: system-ui;
  color: navy;
  text-align: center;
}
h4.date { /* Header 4 - font specifications for the date  */
  font-size: 18px;
  font-family: system-ui;
  color: DarkBlue;
  text-align: center;
  font-weight: bold;
}
h1 { /* Header 1 - font specifications for level 1 section title  */
    font-size: 22px;
    font-family: "Times New Roman", Times, serif;
    color: navy;
    text-align: center;
    font-weight: bold;
}
h2 { /* Header 2 - font specifications for level 2 section title */
    font-size: 20px;
    font-family: "Times New Roman", Times, serif;
    color: navy;
    text-align: left;
    font-weight: bold;
}

h3 { /* Header 3 - font specifications of level 3 section title  */
    font-size: 18px;
    font-family: "Times New Roman", Times, serif;
    color: navy;
    text-align: left;
}

h4 { /* Header 4 - font specifications of level 4 section title  */
    font-size: 18px;
    font-family: "Times New Roman", Times, serif;
    color: darkred;
    text-align: left;
}

body { background-color:white; }

.highlightme { background-color:yellow; }

p { background-color:white; }

</style>
```

```{r setup, include=FALSE}
# code chunk specifies whether the R code, warnings, and output 
# will be included in the output files.
if (!require("knitr")) {
   install.packages("knitr")
   library(knitr)
}
if (!require("tidyverse")) {
   install.packages("tidyverse")
library(tidyverse)
}
if (!require("GGally")) {
   install.packages("GGally")
library(GGally)
}
knitr::opts_chunk$set(echo = TRUE,   # include code chunk in the output file
                      warning = FALSE,# sometimes, you code may produce warning messages,
                                      # you can choose to include the warning messages in
                                      # the output file. 
                      results = TRUE, # you can also decide whether to include the output
                                      # in the output file.
                      message = FALSE,
                      comment = NA
                      )  
library(tidyverse)
library(VIM) 
library(dplyr)
library(Hmisc)
library(ggplot2)
library(plotly)
library(DescTools)
library(cowplot)
library(mice)
library(moments)
library(tidyr)
library(reshape2)  # For data reshaping
library(caret)
library(pROC)
library(randomForest)
library(fastDummies)
```
# Overview of Data

The purpose of this dataset was to help find predictors of diabetes. The goal is to help medical professionals identify patients with potential risk factors of diabetes.The dataset contains information on patient demographics such as age and gender, as well as medical information including blood glucose levels and hypertension. The observations in this dataset were obtained from research studies and healthcare institutions. The dataset was obtained via Kaggle. 

Mustafa, T. Z. (2021). *Diabetes Prediction Dataset*. Kaggle. https://www.kaggle.com/datasets/iammustafatz/diabetes-prediction-dataset

The dataset consists of 8 predictor variables for diabetes:
'gender' the patient's gender as male or female.
'age' the age of the patient in years.
'hypertension' yes or no, on whether the patient has hypertension.
'heart disease' yes or no, on whether the patient has heart disease.
'smoking history' the patient's smoking history indicated as never, no info, current, former or not current.
'bmi' the patient's body mass index (kg/m**2).
'HbA1c level' the patient's average blood sugar level over the past 2-3 months (%).
'blood glucose level' amount of blood sugar in the patient's blood (mg/dL).



```{r fig.align='center', fig.width=7, fig.height=5, fig.cap='The plot of shows the frequency of patients with diabetes and hypertension. The plot indiactes a higher frequency of patients not having diabetes and not having hypertension. There does not seem to be an interaction between having diabetes and having hypertension based on the low frequency of patients having both. '}
#load in dataset
data <-read.csv('https://raw.githubusercontent.com/GabbyK8/Datasets/refs/heads/main/diabetes_prediction_dataset.csv')
diabetesd<-data

#change heart disease, diabetes and hypertension to 1 and 0 for easier use in MICE.
data$diabetes <- recode(data$diabetes, "1"="Yes", "0"="No")
data$hypertension <- recode(data$hypertension, "1"="Yes", "0"="No")
data$heart_disease <- recode(data$heart_disease, "1"="Yes", "0"="No")

#plot interaction between diabetes and hypertension.
ggplot(data, aes(x = hypertension, fill = diabetes)) +
  geom_bar(position = "dodge") + # Use "stack" for a stacked bar chart
  labs (title="Risk of Diabetes by Hypertension", x="Hypertension", y= "Frequency",
        fill= "Diabetes")

```

# Comparison of Two Categorical Variables

The purpose of this graph is to show the relationship between having hypertension and diabetes. For this graph, I changed the values of 1 in both the diabetes and hypertension variable to "Yes" and the value 0 to "No." These integers were meant to represent categorical variables with 1 correlating to being positive for hypertension or diabetes and 0 correlating to being negative to hypertension or diabetes. Based on our graph, we can see that there is no correlation between having diabetes and having hypertension, with most patients within our sample having neither diabetes or hypertension. Thus, hypertension would not be a good predictor of diabetes within patients.
```{r fig.align='center', fig.width=7, fig.height=5, fig.cap='The plot shows the interaction between blood glucose levels and HbA1c levels on diabetes. Based on the plot, higher blood glucose levels and HbA1c levels are positively correlated with the presence of diabetes. In comparison, lower levels, indicate that a patient will more commonly not have diabetes. The height of the points shown for diabetes represent the levels of HbA1c, based on the plot, HbA1c levels predict the presence of diabetes more often than blood glucose levels. '}


interact = ggplot(data, aes(x =blood_glucose_level , y = HbA1c_level, group = diabetes, color = diabetes)) +
  geom_line(size = 1) +  # Lines representing interaction
  geom_point(size = 2) + # Points for data
  labs(
    title = "Interaction of Blood Glucose and HbA1c Levels",
    x = "Blood Glucose Level",
    y = "HbA1c Level",
    color = "Presence of Diabetes"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 14, face = "bold", hjust = 0.5),
    axis.text = element_text(size = 12),
    legend.position = "top"
  )
interact

```
# Comparison of Two Numerical Variables

The two variables being compared here are Blood Glucose Level and HbA1c Level. HbA1cc is a shortened term for hemoglobin A1c. Hb1Ac level is a measure of average blood sugar levels over 2-3 months and is measured in %. Blood glucose level is the amount of blood sugar in a patients blood at a given time. Our plot shows a correlation between high levels in HbA1c and blood glucose levels and it's relationship with diabetes. This plot suggests patients with high levels of HbA1c and blood glucose levels are at high risk for diabetes.

```{r fig.align='center', fig.width=7, fig.height=5, fig.cap='The boxplot shows the relationship between age and presence of diabetes. The boxplot shows that older patients on average are more often have diabetes. There are a few outliers in the diabetes population that appear to be under the age of 20 years. This could be representative of patients with Type I diabetes, since the data does not distinguish between Type I and Type II diabetes. '}

#plot of a numerical and categorical variable
 boxplot(data$age ~ data$diabetes,
         xlab="Patient Age",
         ylab="Presence of Diabetes",
         col = c("skyblue", "purple"),
         main="Visualization of Diabetes by Age",
         cex.main = 1.1,
         col.main = "navy")
```

# Comparison of a Numerical and Categorical Variables

This chart shows the relationship between age and diabetes diagnosis. This chart shows that as patients grow older they are at higher risk for diabetes. This would make age a predictor for diabetes, with the average age for those without diabetes being 40 years and 60 years for those with diabetes. However, we can see that we do have a few outliers for those with diabetes with some patients being diagnosed at 20 or younger. This could be from the population of people with Type I diabetes which is usually diagnosed in patients under 20. Our dataset does not distinguish between type I and type II diabetes.


# Conclusion for Comparison of Variables

Diabetes can be predicted using age, blood glucose levels, and HbA1c levels in patients. However, our results showed hypertension is not a good predictor of diabetes. Based on our findings, it would be interesting to look at factors that occur once a patient becomes older that could lead to diabetes at an older age. As well, diet could be another factor to look into, such as grams of sugar consumed on a daily or weekly basis, to see if sugar consumption can predict diabetes. Another important thing to analyze in future studies would be Type I versus Type II diabetes, as it could be the reason for some of the outliers in the boxplot. 


## Overview of Missing Variables

An important step in data analysis for machine learning is handling missing values. Missing values in a dataset can occur for many reasons, but not limited to non-response in surveys, participants leaving the study, data entry errors and system limitations. If missing values are not addressed, models can become biased and accuracy of the anaylysis can decrease. This section examines different imputation strategies to effectively handle missing values.

The imputation methods we will examine include:

Replacement Imputation for Categorical Features: Mode imputation and KNN-based imputation.

Regression-based Imputation for Numerical Features: Predictive modeling to estimate missing values.

Multiple Imputation: Advanced methods such as MICE to improve robustness.

```{r}
# create random observation ID and replace the corresponding obs with missing
ltr.missing.id <- sample(1:1000, 50, replace = FALSE)
gpa.missing.id <- sample(1:1000, 15, replace = FALSE) 
diabetesd$bmi[ltr.missing.id] <- NA
diabetesd$heart_disease[gpa.missing.id] <- NA
diabetesd$smoking_history[diabetesd$smoking_history == "No Info"] <- NA

```

# Mode Imputation for Categorical Variables

In this method, missing variables are replaced with the most frequent category of the corresponding variable. Mode imputation ensures consistency and works well when missing values are randomly distributed. Below, mode imputation is used on the variable, heart disease, which is 1 if the patient has heart disease and 0 if the patient does not have heart disease.The variable smoking history had current, former, no info, never, and ever as possible values. No Info indicates that we do not have information on the participant's smoking history and will be treated as a missing values.

```{r fig.align='center', fig.width=7, fig.height=5, fig.cap='The plot shows the comparison between the original data containing missing values for the variables heart disease and smoking history, and the data after using mode impute. The histograms show that mode impute used what was most common to replace the missing values. This is seen clearly in the variable smoking history, where "never" went from originally a count of about 40,000 to about 70,000. Heart disease had a small amount of missing data (only 15), however the 15 missing were changed to 0, indicating the patient would not have heart disease.'}


diabetesd <-diabetesd
# Function to impute mode

Mode <- function(x) {
  ux <- unique(na.omit(x))  # Remove NAs before computing mode
  tab <- tabulate(match(x, ux))
  mode_value <- ux[which.max(tab)]
  
  # Ensure mode_value is of the same type as x
  if (is.factor(x)) {
    return(factor(mode_value, levels = levels(x)))
  } else if (is.character(x)) {
    return(as.character(mode_value))
  } else {
    return(as.numeric(mode_value))
  }
}
#apply mode imputation
value_imputed <- data.frame(
  original = diabetesd$heart_disease,
  original2=diabetesd$smoking_history,
  imputed_modehd = replace(diabetesd$heart_disease, is.na(diabetesd$heart_disease), Mode((diabetesd$heart_disease))),
  imputed_modesh = replace(diabetesd$smoking_history, is.na(diabetesd$smoking_history), Mode(diabetesd$smoking_history)))
summary(value_imputed)

#plot the comparison of original data and mode imputed data
h1 <- ggplot(value_imputed, aes(x = original)) +
  geom_histogram(bins=10, fill = "#ad1538", color = "#000000", position = "identity") +
  ggtitle("Original for Heart Disease") +
  theme_classic()
h2 <- ggplot(value_imputed, aes(x = original2)) +
  geom_bar( fill = "#15ad4f", color = "#000000", position = "identity") +
  ggtitle("Original for Smoking History") +
  theme_classic()
h3 <- ggplot(value_imputed, aes(x = imputed_modehd)) +
  geom_histogram(bins=10, fill = "#1543ad", color = "#000000", position = "identity") +
  ggtitle("Mode-imputed for Heart Disease") +
  theme_classic()
h4 <- ggplot(value_imputed, aes(x = imputed_modesh)) +
  geom_bar( fill = "#ad8415", color = "#000000", position = "identity") +
  ggtitle("Mode-imputed for Smoking History") +
  theme_classic()


plot_grid(h1, h2, h3, h4, nrow = 2, ncol = 2)



```
# Regression-Based Imputation for Numerical Variables

Using regression models, we can estimate missing values. For the variable, bmi, we can predict the missing values using estimates from our model. Bmi is a continuous variable, unlike the previous example where our values could only be 0 or 1. In this method, a linear model was created to help create estimates of our values. 

```{r fig.align='center', fig.width=7, fig.height=5, fig.cap='The boxplots show the comparison of the variable bmi before and after regression imputation.' }

#create a dataset for regression impute
regimpute<-diabetesd

# Identify rows where BMI is missing
missing_rows <- which(is.na(diabetesd$bmi))  # Get row indices


# Train a linear model using complete cases
lm_model <- lm(bmi ~ age + blood_glucose_level + HbA1c_level + heart_disease + hypertension + diabetes, 
               data = regimpute, na.action = na.exclude)

# Impute missing BMI values using the model
diabetesd$bmi[missing_rows] <- predict(lm_model, newdata = regimpute[missing_rows, ])


dep.mi <- bind_rows(
  data.frame(value = diabetesd$bmi, imputed = regimpute$bmi))
             #imp = "dep.imp1")


i7<-boxplot(diabetesd$bmi,
main = "Mean bmi in Original Data",
xlab = "BMI",
col = "blue",
border = "black",
horizontal = TRUE,
notch = TRUE
)
i8 <-boxplot(regimpute$bmi,
main = "Regression Imputation for BMI",
xlab = "BMI",
col = "red",
border = "black",
horizontal = TRUE,
notch = TRUE
)


 

```

# MICE Imputation

MICE is multiple imputation techniques used to impute missing values. This method also uses regression models and accounts for uncertainty by generating multiple possible values. Using MICE, the variables smoking history, heart disease, and bmi, can be imputated within one function. MICE can handle different types of variables, imputates based on the relationship between variables and it accounts for variability in missing data. 
```{r fig.align='center', fig.width=7, fig.height=5, fig.cap='Based on the three graphs, MICE seems to have produced values that appear almost identical to the original dataset. Mode impute seems to be slightly different since it relies on filling missing values with the most common value. In this case, mode impute filled all missing values as 0 and under-represented 1.'}

#reload dataset so that it is free of mode and regression imputation done previously
data2 <- read.csv('https://raw.githubusercontent.com/GabbyK8/Datasets/refs/heads/main/diabetes_prediction_dataset.csv')

#create missing values and make smoking history a binary variable
ltr.missing.id <- sample(1:1000, 50, replace = FALSE)
gpa.missing.id <- sample(1:1000, 15, replace = FALSE) 
data2$smoking_history[data2$smoking_history== "No Info"] <-NA
data2$bmi[ltr.missing.id] <- NA
data2$heart_disease[gpa.missing.id] <- NA

#make variables numeric
data2$gender <- as.numeric(as.factor(data2$gender))
data2$heart_disease<- as.factor(as.numeric(data2$heart_disease))
data2$smoking_history<-as.factor(data2$smoking_history)
#apply MICE imputation
df_mice <- mice(data2, method = c("","","","pmm","polyreg","pmm","","",""),  m = 20, maxit=20,seed = 123, print=F)

# Complete dataset with imputed values
df_imputed <- complete(df_mice)

#plot comparisons of before and after MICE imputation for heart disease
g1 <- ggplot(value_imputed, aes(x = imputed_modehd)) +
  geom_bar(fill = "#1543ad", color = "#000000", position = "identity") +
  ggtitle("Mode-imputed") +
  labs(x="Heart Disease")+
  theme_classic()
g2 <- ggplot(value_imputed, aes(x = original)) +
  geom_bar(fill = "#ad1538", color = "#000000", position = "identity") +
  ggtitle("Original") +
  labs(x="Heart Disease")+
  theme_classic()
g3 <- ggplot(df_imputed, aes(x = heart_disease)) +
  geom_bar(fill = "purple", color = "#000000", position = "identity") +
  ggtitle("MICE") +
  labs(x="Heart Disease")+
  theme_classic()
plot_grid(g1, g2, g3, nrow=1, ncol =3)
```

```{r fig.align='center', fig.width=7, fig.height=5, fig.cap='The bar graphs show the comparison of the original data to mode imputation and MICE. MICE appears to have dispersed the data more evenly across the different categories, while mode imputation has only added values to "never.'}
#plot comparisons for smoking history
g4 <- ggplot(df_imputed, aes(x = smoking_history)) +
  geom_bar( fill = "purple", color = "#000000", position = "identity") +
  ggtitle("MICE") +
  labs(x="Smoking History")+
  theme_classic()
g5 <- ggplot(value_imputed, aes(x = original2)) +
  geom_bar( fill = "red", color = "#000000", position = "identity") +
  ggtitle("Original Smoking History") +
  labs(x="Smoking History")+
  theme_classic()
g6 <- ggplot(value_imputed, aes(x = imputed_modesh)) +
  geom_bar( fill = "blue", color = "#000000", position = "identity") +
  ggtitle("Mode-imputed") +
  labs(x="Smoking History")+
  theme_classic()
plot_grid(g4,g5,g6, nrow=3, ncol=1)
```
```{r fig.align='center', fig.width=7, fig.height=5, fig.cap='The boxplots show the comparison of the original data to both MICE and regression imputation.'}

#plot comparisons for bmi
g7<-boxplot(diabetesd$bmi,
main = "Mean bmi in Original Data",
xlab = "BMI",
col = "blue",
border = "black",
horizontal = TRUE,
notch = TRUE
)
g8 <-boxplot(regimpute$bmi,
main = "Regression Imputation for BMI",
xlab = "BMI",
col = "red",
border = "black",
horizontal = TRUE,
notch = TRUE
)
g9 <-boxplot(df_imputed$bmi,
main = "MICE for BMI",
xlab = "BMI",
col = "purple",
border = "black",
horizontal = TRUE,
notch = TRUE
)



```

```{r fig.align='center', fig.width=7, fig.height=5}
model5 <- with(df_mice, lm(diabetes ~ bmi+ heart_disease+ smoking_history ))
summary(pool(model5))
```
# Skewness

Skewness is a measure of symmetry within a distribution. The skewness measurements below are of the variables age, bmi, HbA1c level, and blood glucose level, respectively. When skewness is equal to 0, we can assume the data is normally distributed. When skewness is greater than 0, our data is positively skewed, while a value less than zero indicates negatively skewed data. For the diabetes dataset, the variables bmi and blood glucose level are 1 or about 1, which indicate a significant right skew in the distribution. 
```{r}



age1<- skewness(df_imputed$age) 
bmi1<-skewness(df_imputed$bmi) 
HbA1c1<-skewness( df_imputed$HbA1c_level) 
glucose1<-skewness(df_imputed$blood_glucose_level)

mytable<-data.frame(age1,bmi1,HbA1c1, glucose1)
kable(mytable, caption= "Skewness of Numeric Variables")

```
# Transforming Data

To handle our skewed variables, we can normalize the variables bmi and blood glucose level. Normalization rescales data into a fixed range of [0,1]. By normalizing our data, we can fix skewness and create normally distributed data. The variables, age and HbA1c level have been standardized to keep the distribution centered. Standardization is used to create a mean of 0 and a standard deviation of 1, to follow a normal distribution. Based on the results below, we can see that the transformation has corrected the right skew in the data from before.


```{r}
df_trans <- df_imputed  # Copy original dataset after MICE

#ensure diabetes is numeric and a factor
df_trans$diabetes<- as.numeric(as.factor(df_trans$diabetes))
# Min-Max Normalization
normalize <- function(x) {
  if (max(x) - min(x) == 0) return(rep(0, length(x)))  # Prevent division by zero
  return((x - min(x)) / (max(x) - min(x)))
}

#apply normalization
df_trans$bmi <- normalize(df_imputed$bmi)
df_trans$blood_glucose_level <- normalize(df_imputed$blood_glucose_level)

# Z-Score Standardization
standardize <- function(x) {
  return((x - mean(x, na.rm = TRUE)) / sd(x, na.rm = TRUE))
}

#apply standardization
df_trans$age <- standardize(df_trans$age)
df_trans$HbA1c_level <- standardize(df_trans$HbA1c_level)

# Apply Log Transformation to Fix Skewness (if necessary)
df_trans$bmi <- log1p(df_trans$bmi)  # log(1 + x) avoids log(0) issues
df_trans$blood_glucose_level <- log1p(df_trans$blood_glucose_level)

bmi2 <- skewness(df_trans$bmi)
age2 <- skewness(df_trans$age)
HbA1c2<- skewness(df_trans$HbA1c_level)
glucose2<- skewness(df_trans$blood_glucose_level)

newtable<-data.frame(bmi2,age2,HbA1c2,glucose2)
kable(newtable, caption="Newly Transformed Data")


```
Transforming our data has successfully decreased skewness in our variables. We can see bmi and blood glucose level are no longer about 1. 

```{r fig.align='center', fig.width=7, fig.height=5, fig.cap='We can see in the graphs above that after transforming the dataset, the data seems to follow a normal distribution more closely.This is apparent in the histograms appearing more in a bell shape curve.' }

# Combine old and new data for comparison
df_compare <- data.frame(
  bmi_original = df_imputed$bmi,
  bmi_transformed = df_trans$bmi,
  blood_glucose_original = df_imputed$blood_glucose_level,
  blood_glucose_transformed = df_trans$blood_glucose_level
)

# Reshape data for ggplot
df_long <- melt(df_compare)

# Plot histograms
ggplot(df_long, aes(x = value, fill = variable)) +
  geom_histogram(alpha = 0.6, bins = 30, position = "identity") +
  facet_wrap(~ variable, scales = "free") +
  theme_minimal() +
  labs(title = "Comparison of Original vs. Transformed Data", x = "Value", y = "Count")


```




# Analyzing Data
The next step is to find which variables are significantly important when trying to predict diabetes. By using feature selection, we are able to predict possible best fit models for our variable, diabetes. For this data, we will focus on logistic regression. The variables of interest, diabetes, is a binary variable indicating linear regression would not make an ideal model. 



**Random Forest** 
First, we will look at the best model selected by feature selection through random forest:
```{r fig.align='center', fig.width=7, fig.height=5,fig.cap='The table shows the selected variables random forest has selected to use for our model.'}

# Prepare the data
df_trans$diabetes <- as.factor(df_trans$diabetes)


# Set up RFE control with random forest
control <- rfeControl(functions = rfFuncs, method = "cv", number = 10)

# Apply RFE to identify top features
results <- rfe(
  x = df_trans[, !colnames(df_trans) %in% "diabetes"],
  y = df_trans$diabetes,
  sizes = c(1:9),
  rfeControl = control
)

# Print the selected features
models<-data.frame(predictors(results))
kable(models, caption="Random Forest Selected Model")


```
**Stepwise Logistic Regression**
Next, let's look and stepwise logistic regression

```{r fig.align='center', fig.width=7, fig.height=5, fig.cap='The coefficients shown below are the ones stepwise logistic regression has chosen for the final model'}
#Stepwise Logistic Regression
# Fit a full logistic regression model with all predictors
full_model <- glm(diabetes ~ ., data = df_trans, family = binomial)

# Perform stepwise selection (default is backward elimination)
step_model1 <- step(full_model, direction = "both")
summary(step_model1)
```

Both selection types selected the same model! However I will still go through the steps of cross-validation as though I was comparing the two models.

# Cross-Validation

Let's explore cross-validation to see which model is best fit. For logistic regression, it is meaningful to look at accuracy and kappa. 
```{r fig.align='center', fig.width=7, fig.height=5}

#add classification
df_trans$diabetes<- factor(ifelse(df_trans$diabetes==1, "Yes","No"))

train_control <- trainControl(
  method = "cv",         # cross-validation
  number = 10,           # 10-fold
  classProbs = TRUE # use if you're tracking AUC, Sensitivity, etc.
)

#run CV model
cv_model <- train(
  diabetes ~ hypertension + gender +age + HbA1c_level + blood_glucose_level + heart_disease + bmi, 
  data=df_trans, 
  method = "glm", 
  family = "binomial", 
  trControl = train_control
  )

print(cv_model)

```
Accuracy shows the overall correctness of the model predictions and kappa measures the levels of agreement between variables. Values closer to 1 suggest higher accuracy and better agreement. This model suggests high accuracy and high agreement based on the values.

# ROC Curve and AOC Curve

**Confusion Matrices**
ROC is also known as Receiver Operating Characteristic Analysis. It is a technique using graphs that evaluates the performance of a binary model. Before plotting the ROC curve, it is necessary to calculate the true positive rate (TPR) and false positive rate (FPR). Below, are confusion matrices that were used to calculate those values.
```{r fig.align='center', fig.width=7, fig.height=5,'These matrices are used to calculate TPR and FPR. These values are used to create the ROC curve.'}

#ROC and AUC
df_trans$diabetes <- as.factor(df_trans$diabetes)

# fit a logistic
model.logit <- glm(diabetes ~ age + gender+ bmi+hypertension + heart_disease + blood_glucose_level+ HbA1c_level , family = binomial, data = df_trans)
# predict probability of P(Y = "Yes")
probabilities <- round(as.vector(predict(model.logit, type = "response")),3)
#
thresholds <- c(0.0, 0.25, 0.5, 0.75, 1.0)

# Loop through thresholds and create confusion matrices
for (threshold in thresholds) {
  cat("\nConfusion Matrix for Threshold =", threshold, "\n")
  
  # Convert probabilities to predictions
  # am: 1 = manual transmission, 0 = automatic transmission
  predictions <- ifelse(probabilities > threshold, "Yes", "No")
    all_levels <- union(levels(factor(df_trans$diabetes)), levels(factor(predictions)))
    # Generate confusion matrix
  cm <- confusionMatrix(factor(predictions, levels=all_levels), factor(df_trans$diabetes,levels=all_levels), positive = "Yes")
  print(cm$table)
}

```
**ROC Curve**
```{r fig.align='center', fig.width=7, fig.height=5,'The plot shows the path of the ROC curve.'}

#create ROC curve

TPR = c(1,91500/(91500+0), 91356/(91356+144), 90632/(90632+868), 88160/(88160+3340), 0/(91500+0))
FPR = c(1,352/(352+8148), 4282/(4282+4218), 3193/(5307+3193), 2178/(2178+6322), 0/(8500+0))
plot(FPR, TPR, type = "b", main = "An Illustrative ROC Curve", col ="blue",
     xlab="1 - Specifity (FPR)", ylab = "Sensitivity (TPR)")
# add a off-diagonal representing random guess algorithm in binary prediction
abline(0,1, lty = 2, col = "red")
# legend
legend("bottomright", c("Logistic Model", "Random Guess"),
       col=c("blue", "red"), lty = 1:2, bty="n", cex = 0.9)
```
**AUC (Area under the Curve)**
AUC quantifies the performance of the ROC curve into a single value that ranges from 0 to 1. The plot below uses the Riemann Sum approximation to estimate the area under the curve.
```{r fig.align='center', fig.width=7, fig.height=5,'The plot estimates the area of the ROC Curve. The region is divided into subregions, A,B, and C. These values are used to calculate the area under the curve.' }
#AUC
TPR = round(c(1,91500/(91500+0), 91356/(91356+144), 90632/(90632+868), 88160/(88160+3340), 0/(91500+0)),3)
FPR = round(c(1,352/(352+8148), 4282/(4282+4218), 3193/(5307+3193), 2178/(2178+6322), 0/(8500+0)),3)
TPR0 = TPR[7:1]
FPR0 = FPR[7:1]
#plot(FPR0, TPR0, type = "b")
datSenSpe = data.frame(TPR0, FPR0)
ggROC = ggplot(data = datSenSpe, aes(x = FPR0, y=TPR0)) +
        geom_line(col = "steelblue") +
        geom_point(col = "red") +
        geom_segment(x = FPR0, y = 0, xend = FPR0, yend = TPR0, color = 4) +
        geom_segment(x = 0, y = 0, xend = FPR0[7], yend = 0, color = 6) +
        ggtitle("Approximating the AUC of Logistic Model") +
        xlab("1-specificity (FPR)") + 
        ylab("Sensitivity (TPR)") +
        annotate("text", x = 0.025, y = 0.125, label= "A") + 
        annotate("text", x = 0.105, y = 0.5, label = "B") +
        annotate("text", x = 0.605, y = 0.5, label = "C") +
        theme(plot.title = element_text(hjust = 0.5),
              legend.position = c(0.8, 0.2),
              plot.margin = unit(c(0.15, 0.15, 0.75, 0.15), "inches"),
              axis.line = element_line(size = 2, colour = "navy", linetype=1))
# partition the region under the ROC into trapezoids
ggplotly(ggROC)
```

**AUC Value**
```{r fig.align='center', fig.width=7, fig.height=5,'The area under the curve is 0.9795 which is close to 1, this indicates an ideal model.'}
A<-(0.041*1)/2
B<- ((0.504-0.041)*1)
C<- ((1-0.504)*1)

AUC<- A+B+C
kable(AUC, caption= "Area under the Curve")

```
# Conclusion for Model Fit
Our best fit model is :
$$
\text{ Diabetes} = 9.99 - 0.27\times \text{gender} - 1.03\times \text{age}  -0.8\times \text{hypertension} - 0.76\times \text{heart disease} -10.24\times \text{bmi} - 2.49\times \text{HbA1c level} - -10.3\times \text{blood glucose level}
$$


The best fit model was selected based on a variety of different features. Both stepwise selection and random forest selected the model as best fit. After running performance analyses on the model, the model was shown to be of high performance. This means that the variables chosen for the final model are significant for predicting diabetes. The model chose not to select smoking history as significant predictor for the response variable. A possible suggestion for this would be to reduce the number of response values to a binary response. Such as, "Have you ever smoked?" and have the response be "Yes" or "No". This could help in predicting the relationship between smoking and diabetes. 
